Pattern classification with missing data 
using belief functions 

Zhun-ga Liu" ft , Quan Pan a , Gregoire Mercier b , Jean Dezert c 

a. School of Automation, Northwestern Polytechnical University, Xi’an, China. Email: liuzhunga@gmail.com 

b. Telecom Bretagne, CNRS UMR 6285 Lab-STICC/CID, Brest, France, Email: Gregoire. Mercier@telecom-bretagne.eu 

c.ONERA - The French Aerospace Lab, F-91761 Palaiseau, France. Email: jean.dezert@onera.fr 



Abstract — The missing data in incomplete pattern can have 
different estimations, and the classification result of pattern with 
different estimations may be quite distinct. Such uncertainty 
(ambiguity) of classification is mainly caused by the loss of 
information in missing data. A new prototype-based credal 
classification (PCC) method is proposed to classify incomplete 
patterns using belief functions. The class prototypes obtained by 
the training data are respectively used to estimate the missing 
values. Typically, in a c-class problem, one has to deal with 
c prototypes which yields c estimations. The different edited 
patterns based on each possible estimation are then classified 
by a standard classifier and one can get c classification results 
for an incomplete pattern. Because all these classification results 
are potentially admissible, they are fused altogether to obtain 
the credal classification of the incomplete pattern. A new credal 
combination method is introduced for solving the classification 
problem, and it is able to characterize the inherent uncertainty 
due to the possible conflicting results delivered by the different 
estimations of missing data. The incomplete patterns that are 
hard to correctly classify will be reasonably committed to some 
proper meta-classes by PCC method in order to reduce the 
misclassification rate. The use and potential of PCC method is 
illustrated through several experiments with artificial and real 
data sets. 

Index Terms — belief functions, evidence theory, missing data, 
data classification, fusion rule 

I. Introduction 

The classification of incomplete patterns with missing val- 
ues is an important topic in the field of machine learning. 
There have been many methods [1] emerged for classifying 
incomplete patterns, and it mainly concerns the handling miss- 
ing values and pattern classification. The simplest method just 
deletes the incomplete patterns [2], and the classifier is applied 
only for the complete patterns. The model of probability 
density function (pdf) of the whole data set is also sometimes 
derived for the classification based on the Bayes decision 
theory [3]. Some classifiers [4] particularly designed for deal- 
ing with the incomplete data without estimation of missing 
values have also been developed. The imputation strategy [5] 
is often adopted for missing values in many cases, and then the 
edited patterns with estimated values are classified. A number 
of methods have been introduced for imputation of missing 
values, and they can be generally grouped into two types [1], 
One type is statistical analysis imputation methods including 
mean imputation, regression imputation, multiple imputation, 
hot deck imputation, and so on. Particularly, in the mean 



imputation (MI) method [6], the missing values are replaced 
by the mean of known values of that attribute. Another type is 
imputation methods based on machine learning, it includes the 
K-nearest neighbor imputation (KNNI) and SOM imputation, 
etc. In the often used KNNI method [7], the missing values 
are estimated using the K-nearest neighbors of the object 
(incomplete pattern). 

The missing data can have several different possible esti- 
mated values, and the classification result of the incomplete 
pattern (test sample) with different estimations can be very 
different sometimes. For example, an object using a given 
estimation of missing data can be classified into the class 
A with biggest probability, but it could also be most likely 
classified into the class B, with A 0/1 = 0 using another 
given estimation of missing data. Such conflict (uncertainty) 
of classification is caused by the lack of information of the 
missing (unknown) values, and it is really hard to correctly 
classify the object in such condition because the known (avail- 
able) attributes information is really insufficient for making 
a specific classification. The belief function framework intro- 
duced by Shafer [8]— [10] in Dempster-Shafer theory (DST) is 
appealing for dealing with such uncertain and imprecise infor- 
mation [11], Belief functions have been already used in many 
fields, such as data classification [ 12]— [ 16], data clustering 
[17]— [20], and decision-making [21]. Some data classification 
methods [ 16] have been developed based on DST. A K-nearest 
neighbors rule based on DST is proposed in [13], and a neural 
network classifier working with DST is presented in [14], 
In the aforementioned methods, the meta-classes defined by 
the disjunction of several specific classes (i.e. the partially 
ignorant classes) are not considered as potential solutions of 
the classification. In our very recent work, a new belief K- 
nearest neighbor (BK-NN) classifier [15] working with credal 
classification has been presented to deal with uncertain data 
by considering all possible meta-classes in the classification 
process because the meta-classes are truly useful and important 
to represent the imprecision of the classification. Nevertheless, 
these classification methods working with belief functions were 
all designed for classifying complete patterns only, and the 
missing data aspect was not taken into account. 

In this work, a new prototype-based 1 credal classification 

The estimation of missing data in this new method is based on the 
prototypes of the classes. 



(PCC) method is proposed for the classification of incomplete 
patterns under belief function framework. The object hard to 
correctly classify due to the uncertainty (imprecision) caused 
by the missing values will be reasonably committed to the 
proper meta-class defined by the union (disjunction) of several 
specific classes (e.g. A U B) that the object likely belongs to. 
This approach allows us to both reduce the misclassification 
error rate, and to reveal the imprecision of the classification. 
This paper is organized as follows. After a brief introduction 
of the basics of evidential reasoning in section II, the new 
prototype-based credal classification method is presented in 
the section III. The proposed method PCC is then tested in 
section IV and compared with two other classical methods, 
followed by conclusions. 

II. Brief recall of evidence theory 

The belief functions have been introduced by Shafer in 
his original Mathematical Theory of Evidence [8]— [ 10]. This 
theory is also known classically as Evidential Reasoning 
(ER) approach, or also as Dempster-Shafer Theory (DST). 
In this theory, one starts with a frame of discernment f2 = 
{wi, . . . , u>i, . . . , uj c } consisting of a finite discrete set of mutu- 
ally exclusive and exhaustive hypotheses (classes). The power- 
set of fl, denoted 2 n , is the set of all the subsets of fl. For 
example, if fl = {wi,w 2 ,W 3 }, then 2 n = {0, cui, W3, U 
oj 2, oji U W3 , u >2 U0C3 , fl}. The singleton class (e.g. uif) is called a 
specific class. The disjunctions (union) of several single classes 
that represent the partial ignorances in 2 n (e.g. uj t U uij , or 
uii U uij U u>k, etc) are called meta-classes. 

A basic belief assignment (BBA) is a function m(.) from 2 n 
to [0, 1] satisfying m(A) = 1 and m(0) = 0. The subsets 
Ae 2 n 

A of fl such that m ( A ) > 0 are called the focal elements of 
m(.). The credal classification (partition) [17], [18] is defined 
as n-tuple M = (mi, • • • , m n ), where m, is the basic belief 
assignment of the object Xj £ X, i = 1, . . . , n associated with 
the different elements of the power-set 2 e . The mass of belief 
of meta-class can well reflect the imprecision (ambiguity) 
degree of the classification of the uncertain data. The lower and 
upper bounds of imprecise probability associated with BBAS 
correspond to the belief function Bel(.) and the plausibility 
function Pl(.) [8]. They are given for all A £ 2 n by 

Bel(A) = m(B) ( 1 ) 

BC-A 

Pl(A)= m (£) ( 2 ) 

snA/0 

Bel(.) and Pl(.) can be used for decision-making support 
when adopting pessimistic or optimistic attitudes if necessary. 

In DST framework, Shafer proposed that the different 
pieces of evidence represented by BBAS should be combined 
using Dempster’s rule [8], commonly denoted DS rule in the 
literature and represented by ® symbol. Mathematically, DS 
rule of combination of two BBAS TOi(.) and m 2 {-) defined on 
2 e is defined by m£>s(0) = 0 and for A 0, B : C £ 2 e by 



X] mi(B)m 2 {C) 

m DS (A) = [m, ®m 2 ](A) = mi(B)milc) < 3 > 

snc/0 

In DS rule, the total conflicting belief mass is redistributed 
back to all the focal elements through a classical normaliza- 
tion step. However, it is known that DS rule produces very 
unreasonable results not only in the high conflicting cases, but 
also in some very special low conflicting cases as well [23], 
[24], and that is why many other combination rules [25] have 
been developed to overcome its limitations. 

III. New method for classification of incomplete 

PATTERNS 

The new prototype-based credal classification (PCC) 
method provides multiple possible estimations of missing 
values according to class prototypes obtained by the training 
samples. For a c-class problem, it will produce c probable 
estimations. The object with each estimation is classified 
using any standard 2 classifier. Then, it yields c pieces of 
classification results, but these results take different weighting 
factors depending on the distance between the object and the 
corresponding prototype. So the c classification results should 
be discounted with different weights, and the discounted results 
are globally fused for the credal classification of the object. If 
the c classification results are quite consistent on the decision 
of class of the object, the fusion result will naturally commit 
this object to the specific class that is supported by the 
classification results. However, it can happen that high conflict 
among the c classification results occurs which indicates that 
the class of this object is quite imprecise (ambiguous) only 
based on the known attribute values. In such conflicting case, 
it becomes very difficult to correctly classify the object in 
a particular (specific) class, and it becomes more prudent 
and reasonable to assign the object to a meta-class (partial 
imprecise class) in order to reduce the misclassification rate. 
By doing this, PCC is able to reveal the imprecision of the 
classification due to the missing values which is a nice and 
useful property. Indeed in some applications, specially those 
related to defense and security (like in target classification) the 
robust credal classification results are usually more preferable 
than the precise classification results subject potentially to a 
high risk of error. The classification of the uncertain object in 
meta-class can be eventually precisiated (refined) using some 
other (costly) techniques or with extra information sources if 
it is really necessary. So PCC approach prevents us to take 
erroneous fatal decision by robustifying the specificity of the 
classification result whenever it is necessary to do it. 

A. Determination of c estimations of missing values in incom- 
plete patterns 

Let us consider a test data set X = {xi, . . . , x^v} to be 
classified using the training data set Y = {yi, . . . , y u } in the 
frame of discernment fl = (tui, . . . , w c }. Because we focus on 

2 In our context, we call standard a classifier working with complete patterns. 



the classification of the incomplete data (test sample) in this 
work, one assumes that the test samples are all incomplete 
data (vector) with single or multiple missing values, and the 
training data set Y consists of a set of complete patterns. 

The prototype of each class i.e. {oi, . . . , o c } is calculated 
using the training data at first, and o 9 corresponds to class 
u! g . There exists many methods to produce the prototypes. For 
example, the K-means method can be applied for each class 
of the training data, and the clustering center is chosen for the 
prototype. The simple arithmetic average vector of the training 
data in each class can also be considered as the prototype, and 
this method is adopted here for its simplicity. Mathematically, 
the prototype is computed for g = 1 , . . . , c by 

°9 = Y J2 y i (4) 

9 yje^s 

where T g is the number of the training samples in the class 

UJg. 

Once each class prototype is obtained, we use the value 
of the prototype to fill the missing values of the object 
(incomplete pattern) in the same attribute dimension. Because 
one has considered c possible classes with their prototypes, 
one gets c versions of estimated values for the object. For the 
object Xj with some unknown (missing) component values, the 
c versions of estimations of the missing component values x l3 
of x,; are given by 

x ij = °gj ( 5 ) 

where o g j is the j-th component of the prototype o g ,g = 
1,2, ... ,c. 

From each complete estimated vector xf , g = 1,2 . . . , c, 
we can draw a classification result using any standard classifier 
working with the complete pattern. At this step, the choice of 
the classifier, denoted T(.), is left to user’s preference. For 
instance, one can use for T(.) the artificial neural network 
(ANN) approach, or the EK-NN, etc. The c pieces of sub- 
classification results for x, are given for g = 1 , . . . , c by 

pf = r(xf|y) (6) 

where T(.) represents the chosen classifier, and P 9 is the 
output (i.e. classification result) of the classifier when using 
the prototype of class ut g to fill the incomplete pattern x,. Pf 
can be a Bayesian BBA if the chosen classifier works under 
probability framework (e.g. K-NN, ANN), and it can also be a 
regular BBA with having some mass of belief committed to the 
ignorant class Q. if the classifier works under belief functions 
framework (e.g. EK-NN). 

In this new PCC approach, we propose to combine these 
c pieces of classification results in order to get a credal 
classification of the incomplete pattern to classify. These c 
pieces of classification results are considered as c distinct 
sources of evidences. Because the distances between the object 
and the c prototypes are usually different, some discounting 
technique must be applied to weight differently the impact 
of these sources of evidences in the global fusion process. 
If the distance of the object to prototype is big according to 



the known attribute values, it means that the estimation of 
the missing values using this prototype is not very reliable. So 
the bigger distance d l3 usually leads to the smaller discounting 
factor a 3 . A rational way that has been widely applied in many 
works is adopted here to estimate at first the weighting factor 
wf . For g = 1, ... ,c, this factor wf is defined by 



Wi = e di: 



(7) 



where 



dig — 

with 

77T ( y™ - °9sf (9) 

9 y i£u 9 

Xi S is value of x, in s-th dimension, and y lH is value of y, 
in s-th dimension, p is the number of dimensions of known 
values of Xj. The coefficient 1/p is necessary to normalize 
the distance value because each test data can have a different 
number of dimensions of missing values. S gs is the average 
distance of all training data belonging to class ui g to the 
prototype o g in s-th dimension, and it is introduced mainly 
for dealing for the anisotropic data set. T g is the number of 
training samples in the class ui g . 




i A 



J gs 



( 8 ) 



From these weighting factors w 9 for g = 1, . . . , c, one then 
defines the relative reliability factors (discounting factor) a 9 
by 

o 

(10) 



w- 



w f 



where w/ lax = max(tu| , . . . , w/). 

The discounting method proposed by Shafer in [8] is applied 
here to discount the BBA of each source of evidence according 
to the factors a?. More precisely, the discounted masses of 
belief are obtained for g = 1 , . . . , c by 



m 9 (A) = af P? (A) , A c Fl 
m 9 (n) = 1 — a 9 + a 9 P?(n) 



(11) 



In Eq. (11), the focal element A usually represents a 
specific class in f l because most classical classifiers work 
with probability framework only, and thus they just consider 
specific classes as an admissible solution of the classification. 
Nevertheless, some classifiers based on DST, like EK-NN, can 
generate results on specific classes and also on the full ignorant 
class as well. Pf ( A ) is the probability (or belief mass) 
committed to the class A by the chosen classifier. 



B. Fusion of the c discounted classification results 

The c classification results obtained according to the c pro- 
totypes may strongly support different classes that the object 
should belong to. For instance, several sources of evidence 
could strongly support that the object is most likely in class 
A, whereas some others could support strongly the class B, 
with AnB = 0. In practice, some conflict usually exists in the 



global fusion process. The maximum of belief function Bel(.) 
given in Eq. (1) is used as criteria 3 for the decision making 
of the class which is strongly supported by the classification 
results, and the c pieces of results can be divided into several 
distinct groups G±, G 2 , ■ ■ ■ , G r according to the classes they 
strongly support. 

The classification results in the same group are combined at 
first, and then these sub-combination results are globally fused 
for the credal classification. The classification results in the 
same group are generally not in high conflict. Therefore, one 
proposes to apply DS rule (3) to fuse these results, since DS 
rule offers a reasonable compromise between the specificity of 
the result and the level of complexity of the combination. 

For G s = { m j , . . . , mf}, the fusion results of the BBAS 
in the group G s using DS rule are given for a focal element 
A £ 2 n by: 

m“ s (A) = [m 3 ® . . . ® mf](A) (12) 

where ffi represents the DS combination defined in Eq. (3). 
Since DS rule is associative, these BBAS can be combined 
sequentially using eq. (3) and the sequential order doesn’t 
matter. 

These sub-combined BBAS m“ s (.), for s = 1, . . . , r, will 
then be globally fused to get the final BBA of the credal clas- 
sification. In the global fusion process, these sub-combination 
results of the different groups of sub-classification results can 
be in high conflict because of the distinct classes they strongly 
support according to their belief functions. Because DS rule is 
known to produce counter-intuitive results specially in high 
conflicting situations [26] due to its way of redistributing 
the conflicting beliefs, we propose to use another fusion 
rule to circumvent this problem. We recall that in DS rule 
the conflicting masses of belief are redistributed to all focal 
elements by the classical normalization step of Eq. (3). In our 
context, the partial conflicting information are very important 
to characterize the degree of uncertainty and imprecision of 
the classification caused by the missing values, and they 
should be preserved and transferred to the corresponding meta- 
classes specially in the high conflicting situation. But if all 
the partial conflicts are always unconditionally kept in the 
fusion results, they generate a high degree of imprecision of 
the result which is not an efficient solution of the classification. 
To avoid this drawback, in the PCC approach we make a 
compromise between the misclassification error rate and the 
imprecision degree we want to tolerate. This compromise is 
obtained by selecting the conflicting beliefs that need to be 
transferred to the corresponding meta-classes. The selection 
is done conditionally and according to the current context 
following the method explained in the sequel. 

For simplicity and notation convenience, we assume that 
the resulting sub-combined BBA of group G s is focused on 
the the class w s . That is Bel^ a (ui s ) = ma x(Bel^ a (.)) where 
Bel“ a (.) is computed from the BBA m“ a (.) thanks to Eq. (1), 

'The plausibility function Pl{.) can also be used here, since Bel{.) and 
Pl(.) have a straight corresponding relationship in such particular BBAS 
structure. 



for s = 1, . . . , r. This indicates that uj s is strongly supported 
by the BBAS in group G s . Moreover, the class w max is the 
most believed class of the object if one has 

Bei“ m “(w max ) = max(f?e(“ 1 (wi), . . . , Bel“ s {u> g )) (13) 

We remind that w max is the class having the biggest Bel(.) 
value among all the classification groups, whereas ui s ,s = 
1 just takes the biggest Bel{.) value in the group G s . 

In practice however, it can happen that the belief Bel^ a (co s ) of 
the strongest class of the group G s can be very close (or equal) 
to 23e(^ max ( Wmax) but u> s can be different of w max . When such 
case occurs, the object can potentially belong to the other class 
uj s with a high likelihood. So we must consider all the very 
likely specific classes as potential solution of the classification 
of the object x,. The set of these potential classes is denoted 
Aj and it is defined by 

A i = {w s |Be^ max (w max ) - Bel“ a {Lj s ) < e} (14) 

where e £ [0, 1] is a chosen threshold. Because all classes in A,; 
can very likely correspond to the real (unknown) class of x ; , 
they appear not very distinguishable according to the choice 
of the threshold e. This means that a strategy of classification 
of the object x, based only on one specific class of A, is 
very risky because all elements of Aj must be considered as 
acceptable in fact. To reduce misclassification errors with such 
type of strategy, we propose to keep all the subsets of Aj in 
the fusion process and we deal with the involved meta-class. 

If the beliefs of the other classes (e.g. to f ) 
are all much smaller than Bel“ m “(u max ) as 
Bel“ m “(u max ) — Bel“ f (ojf) > e, it means that the 

class w max is generally distinct for the object with respect 
to the other classes (e.g. ujf ). Then, there is no necessity to 
keep the meta-class, and one can just use the specific classes 
in such case. 



The global fusion rule for these sub-combination results is 
defined by: VBj C 12 



for A £ 12 with|A| = 1, or A = 12 

E mriB^-.-mriBr), 

n b s =a 

9 = 1 



rhi(A) = < 



for A C Aj, with \A\ > 2 

E \mT{B 1 )..-m- a (B s ) f[ 

Ml g=|A| + l 



Ml 

U Bi=A 



(15) 

In Eq. (15), r is the number of the groups of the clas- 
sification results. |A| is the cardinality of the hypothesis A, 
and it is equal to the number of singleton elements included 
in A. For example, if A = uji U Uj, then |A| = 2. The 
conjunctive combination, which corresponds to the consensus 
of sub-combination results, is used in the first part of formula 
to calculate the mass of belief of the specific classes and of 



the ignorant class 4 . In the second part of Eq. (15), the partial 
conflicting beliefs are committed to the selected meta-classes 
to reflect the imprecision degree of classification of the object 
with the specific classes included in the meta-class. 

Because not all partial conflicting masses of belief are 
transferred into the meta-classes through the global fusion 
formula (15), the combined BBA is normalized as follows 
before making a decision: 



rrii(A) 



rhi(A) 






(16) 



The credal classification of the object can be made directly 
based on this final normalized combined result BBAS, and 
the object will be assigned to the focal element (a class or 
a meta-class) with maximal mass of belief. The maximum of 
belief Beli{.) of the singleton (specific) class, or the maximum 
of plausibility Pk(.), or the maximum of pignistic probability 
BetPi(.) drawn from the global combined BBA rri, ( . ) are usu- 
ally used as the criteria for making hard classification, but the 
hard classification is not recommended in such uncertain case. 
The credal classification based on the BBAS is preferred here 
since it can well reflect the inherent imprecision (ambiguity) 
degree of the classification due to the missing values. 
Guideline for choosing the meta-class threshold e: In 
the applications, the threshold e of PCC must be tuned 
according to the number of objects in meta-class. A small 
e value generally leads to fewer objects in meta-classes, 
but it may cause more misclassifications for the uncertain 
objects. A big e value yields more objects in meta-class and 
leads to higher imprecision degree, which is not an efficient 
solution for the classification. So e should be tuned according 
to the imprecision degree of the fusion results that one accepts. 



The following simple example shows how PCC works. 
Example 1: Let us consider a 3-D object x,; = [xn, ?, ?] with 
the missing value in the 2nd dimension and 3rd dimension to 
be classified over the frame of classes f l = {wi,u;2,W3}. It 
is assumed that the prototypes O = {01,02,03} of the three 
classes can be calculated using the training data as: 

01 = [on, O12, O13] 

02 = [021, 022, 023] 

03 = [031, 032, 033] 

So the object with three versions of estimation of the missing 
value is obtained by: 

x{ = [xn, 0 12 , 0 13 ] 
xf = [ xn , 022, 023] 

X? = [Xi i, 032,033] 

The patterns with three estimated values are respectively clas- 
sified using a standard classifier, and the classification results 



represented by the probability membership are given by: 

Pl^i) = 0.8, i?M = 0.2 

^>1) = 0.1, = 0.8, i?(w 3 ) = 0.1 

Pf^i) = 0-5, 0.2, i?(w 3 ) = 0.3 

The relative weighting factor of each classification result is 
calculated according to the distance between x, and the three 
prototypes using Eq. (10). For simplicity and convenience, they 
have been randomly chosen as follows for this example: 

a\ = 1, af = 0.9, af = 0.3 

Then, each classification result P^(-), k = 1,...,3 can be 
discounted using Eq. (11), and the discounted BBAS are given 
by 

mj(uj 1) = 0.8, m\ (w 2 ) = 0.2 

m?(w 1) = 0.09, mf{u} 2 ) = 0.72, m^a) = 0.09 = 0.1 

m|(w 1) = 0.15, mf (CU2) = 0.06, mf(w 3) = 0.09, mf (fl) = 0.7 

Because of the particular choice of aj = 1 the BBA mj(.) is 
not discounted in this example. 

The belief functions BeU{.) corresponding to each BBA 
TOj(.) are obtained using Eq. (1) and are given by 

Belj(u>i) = 0.8, Bel\ {^ 2 ) = 0- 2 

Bel?(u 1) = 0.09, Belli w 2 ) = 0.72, Be$(w 3 ) = 0.09 

Belf{u 1) = 0.15, Belf{uj 2 ) = 0.06, Bel?{u 3 ) = 0.09 

For the singleton (specific) class, m}(.) and m?(.) put the most 
belief on class ui 1, whereas mf(.) commits most of mass to 
the class u> 2. It means that the object likely belongs to class uj\ 
with the estimation from prototype 01 and 03, but it is very 
probably classified into 0 J 2 with the estimation according to 
02. This uncertainty (conflict) is mainly caused by the lack of 
discriminant information inherent of the missing values. Then, 
the three BBAS can be divided into the two following groups: 
Gi = and G 2 = {m?(.)}. 

The sub-combination results of each group of BBAS using DS 
rule (3) are: 

mfj.) : m^iw 1) = 0.8173, m“ l {w 2 ) = 0.1827 

m“ 2 (.) : m“ 2 (wi) = 0.09, m^{w 2 ) = 0.72, 

m“ 2 (to 3 ) = 0.09, m? 2 (ft)= 0.1. 

Then one gets: i3e(“ max (w max ) = Bel^iuj]) = 0.8173 and 
Bel^ 2 {ui 2) = 0.72. If the meta-class threshold is chosen as 
e = 0.3, we get Bel^ioj 1) — Bel^ 2 {uj 2 ) < e, and thus 
A,; = {wi,w 2 }- So the meta-class u>i U 0 J 2 will be kept, and 
the conflicting mass of belief produced by the conjunctive 
combination m^ 1 {wi)m^ 2 {w 2 ) + m^ 1 {w 2 )m^ 2 {wi) will be 
transferred to w 1 U uj 2 - 

The global fusion of BBAS m}’ 1 (.) and m“ 2 (.) using Eq. 
(15) yields the following unormalized combined BBA 

ihj(.) : mj(wi) = 0.1553, rhi(w 2) = 0.1498, 
rhi{uji UW2) = 0.6049. 



4 The ignorant class represents the outlier (noisy) class. 



As we see, the BBA ihj(.) is not a normalized BBA because 
some conflicting masses of belief are voluntarily discarded of 
the redistribution on the meta-classes. After the normalization 
step, we finally get: 

mj(.) : mi(wi) = 0.1707, = 0.1646, 

rrii(u! i U 0 J 2 ) = 0.6647. 

One sees that the biggest mass of belief is committed to the 
meta-class w 1 U u> 2 - This result indicates that the classes u> 1 
and ui 2 are not very distinguishable based only on the known 
attribute information, and the object must quite likely belong to 
(jj\ or u >2 according to the different estimations of the missing 
values. In this simple example, it is difficult to commit the 
object to a particular class. If one had to take a specific class 
decision, one would very probably make a mistake. So the hard 
classification is not recommended in such case, and the object 
will be committed to the meta-class W 1 UW 2 by PCC approach, 
which is prudent and reasonable behavior consistent with the 
intuitive reasoning. Some additional sources (if available) need 
to be used and combined with the available information to get 
a more precise classification result. 

IV. Application of new method 

Two experiments have been carried out to test and evaluate 
the performance of this new PCC method. The performances of 
PCC are compared to the performances of the mean imputation 
(MI) method [6], and the K-NN imputation (KNNI) methods 
[7]. In this work, the EK-NN classifier [13] is adopted here 
as the standard classifier to classify the test samples with the 
estimated values in PCC, MI and KNNI, because EK-NN 
produces good results in the classification 5 . The parameters 
of EK-NN were automatically optimized using the method 
proposed in [27]. In order to show the ability of PCC to 
deal with the meta-classes, the class of each object is decided 
according to the criterion of the maximal mass of belief. 
In the applications of PCC, the tuning parameter e can be 
automatically tuned according to the imprecision rate one can 
accept. 

In our simulations, the misclassification is declared 
(counted) for one object truly originated from Wi if it is 
classified into A with Wi Cl A = 0. If Wi Cl A ^ 0 and A ^ Wi 
then it will be considered as an imprecise classification. The 
error rate denoted by Re is calculated by R e = N e /T, where 
N e is number of misclassification errors, and T is the number 
of objects under test. The imprecision rate denoted by Rij is 
calculated by Rij = N,.j /T, where N l;j is number of objects 
committed to the meta-classes with the cardinality value j. 

A. Experiment 1 

This experiment is used to illustrate the use of credal 
classification obtained by PCC with respect to other classical 
methods. We consider a particular 3-class data set f l = 
{wi,W 2 ,W 3 } in the circular shape as shown in Fig. 1-a. Each 

5 In fact, many other standard classifiers can be applied here according to 

the actual request. 



class contains 305 training samples and 305 test samples. Thus, 
we consider 3 x 305 = 915 training samples and 3 x 305 = 915 
test samples. The radius of the circle is r = 3, and the 
centers of three circles are given by the points Ci = (3,3) T , 
C2 = (13,3) t , C3 = (8,8) t , where T denotes the transposed 
vector. The values in the second dimension corresponding to 
y-coordinate of test samples are all missing, and the there is 
only one known value in the first dimension corresponding 
x-coordinate for each test sample. The different meta-class 
selection thresholds e = 0.3 and e = 0.45 have been applied 
in PCC to show their influences on the results. A particular 
value of K = 9 is selected in the classifier EK-NN and the 
K-NN imputation 6 . The classification results of the test objects 
by different methods are given by Fig. 1-b-l-d. For notation 
conciseness, we have denoted w te = w test , w tr = w trainin 9 
and Wi : ... y k — Wi U . . . U Wk- The error rate (in %) and 
imprecision rate (in %) for PCC have been given in the caption 
of each subfigure. 




(c). Classification result by (d). Classification result by 

method with K-NN estimation PCC e = 0.3 

{R e = 4.15). (Re = 1.75, R i2 = 4.81). 




(e). Classification result by PCC t = 0.45 (R e = 0.87, Ri2 = 
8.31). 

Figure 1 . Classification results of 3-class data set by different methods. 

The values of the y-coordinate of the test samples are all 
missing, and the class of each test sample is determined only 

6 In fact, the choice of K ranking from 7 to 15 does not affect seriously the 
results. 



based on the value of x-coordinate. We can see from Fig. l-(a) 
that the class W 3 partly overlaps with the classes wi and 0 J 2 on 
their margins with respect to x-coordinate. The objects lying in 
the overlapped zone are really difficult to be correctly classified 
into a particular class, since u 1 and W 3 (resp. u > 2 and 013 ) seem 
undistinguishable for these objects based on the values on x- 
axis only. The mean and K-NN estimation methods provide 
only one value for the missing data, and then the EK-NN 
classifier is used to classify the test samples with this estimated 
value. The objects are all committed to a particular class by 
these two methods with big error rate, and the results cannot 
well reflect the uncertainty and imprecision of classification 
caused by the missing values. With the PCC approach, most 
objects lying in the overlapped zones are reasonably assigned 
to the proper meta-classes w 1 UW 3 and a >2 UW 3 . So PCC is able 
to reduce the error rate and well characterize the imprecision 
(ambiguity) of the classification thanks to the use of meta- 
class under belief functions framework. One can see that the 
increases of e value lead to the decrease of error rate but 
meanwhile brings the increase of imprecision rate. So we 
should find a good compromise between the error rate and 
imprecision rate. In real applications, e can be optimized using 
the training data, and the optimized value should correspond to 
a suitable compromise between the error rate and imprecision 
rate, e can also be tuned according to the imprecision rate one 
can accept in the classification. 

IS. Experiment 2 

We use the four real data sets (Breast cancer, Seeds, Yeast 
and Wine data sets) available from UCI Machine Learning 
Repository to test the performance of PCC with respect to 
MI and KNNI. Three classes ( CYT,NUC and ME3) are 
selected in Yeast data set to the evaluate our method, since 
these three classes are close and difficult to classify. The basic 
information of the four data sets is given in Table I, and the 
detailed information can be found at http://archive.ics.uci.edu/ 
ml/. 

The A -fold cross validation was performed on the four data 
sets by the different classification methods, and k generally 
remains a free parameter. We used the simplest 2-fold cross 
validation 7 here, since it has the advantage that the training 
and test sets are both large, and each sample is used for both 
training and testing on each fold. Each test sample has n 
missing (unknown) values, and they are missing completely 
at random in every dimension. The average error rate Re a 
and imprecision rate Ri a (for PCC) of the different classical 
methods with values of K ranging from 5 to 20 are given in 
Table II. 

The results of Table II clearly show that the PCC method 
produces lower error rate than the MI and KNNI classification 
methods, but meanwhile it yields some imprecision in the 
classification result due to the introduction of meta-classes 

7 More precisely, the samples in each class are randomly assigned to two 
sets Si and S 2 having equal size. Then we train on Si and test on S 2 , and 
reciprocally. 



Table I 

Basic information of the used data sets. 



name 


classes 


attributes 


instances 


Breast 


2 


9 


699 


Seeds 


3 


7 


210 


Wine 


3 


13 


178 


Yeast 


3 


8 


1050 






Table II 




Classification results for different real data sets (in 




MI 


KNNI 


PCC 


n 


Re 


Re 


{R e , Ri 2 } 


3 


4.71 


6.10 


{4.10, 3.38} 


Breast 5 


8.20 


8.15 


{4.38, 4.69} 


7 


38.33 


14.35 


{7.91, 8.05} 


1 


37.59 


38.13 


{34.36, 6.95} 


Yeast 3 


45.08 


44.29 


{34.71, 18.00} 


5 


51.16 


50.95 


{33.46, 31.01} 


3 


21.03 


9.68 


{7.14, 3.72} 


Seeds 5 


33.49 


12.54 


{9.67, 6.70} 


6 


40.71 


25.87 


{16.79, 12.77} 


3 


30.71 


26.59 


{26.05, 1.05} 


Wine 6 


34.93 


25.84 


{26.62, 0.84} 


10 


39.23 


30.90 


{25.84, 3.86} 



to reflect that some incomplete objects are very difficult to 
classify because of lack of discriminant information. The 
increasing of the number (i.e. n) of missing values in each 
test sample generally causes the increment of error rate in 
the three classifiers. The imprecision rate becomes bigger 
in PCC, since the more missing values lead to the bigger 
imprecision (uncertainty) in the classification problem. So the 
credal classification including meta-class is very useful and 
efficient here to represent the imprecision degree and it can 
help also to decrease the misclassification rate. The PCC 
approach allows to indicate that the objects in meta-classes 
are really difficult to be correctly classified, and they should be 
cautiously treated in the applications. If one wants to get more 
precise results, some other (possibly costly) techniques seem 
necessary to discriminate and classify such uncertain objects. 

V. Conclusion 

A new prototype-based credal classification (PCC) method 
has been presented in this work for classifying incomplete 
patterns thanks to the belief function framework. This PCC 
method allows the object (incomplete pattern) to belong to 
specific classes and meta-class (i.e. union of several specific 
classes) with different masses of belief. The meta-class is used 
to characterize the imprecision of the classification due to 
the missing values and it can also reduce errors. Once the 
PCC result indicates that an incomplete pattern belongs to a 
meta-class, it means that the specific classes included in the 
meta-class are undistinguishable based on the known partial 
available attributes. This incomplete pattern with uncertain 



classification should be treated more cautiously in the appli- 
cation. If one wants to get more precise result, some more 
(possibly costly) techniques or information sources must be 
developed and used. Several experiments with artificial and 
real data sets have been done to evaluate the performances 
of PCC with respect to classical MI and KNNI methods. Our 
results show that PCC is able to well represent the imprecision 
of classification caused by the missing data, and reduce the 
classification error rate. 
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